Baltimore Crime Data Analysis

Date : 12/6/2021

I. Introduction

The main aim of this project is to narrate a story about crime. There are many types of crime occuring at various times and places and this project is to provide insights about crime depending on various conditions like the time of the day, location, weather conditions and various seasons of the year. It also aims to find out whether we can predict occurences of crime based on previous data avaialble and to what accuracy. (still needs work)

A. Research Questions

  1. Can we predict if a violent crime may occur based on past data at a given time and place?

  2. Can we find a trend in the occurence of crime during a certain month? Can we look for a trend for crime occurence along the months? Do different seasons or months affect the category or type of crime taking place?

  3. Can we oberve any changes in occurence of crime over time within a day? Is the crime rate higher during the day time compared to the night time?

    Hypothesis: We expect to see increase in crime rate during night time due to low visibility and fewer pedestrian on street. Another aspect would be that low lighting during night time would lead to poor quality of surveillance which can motivate occurrence of a crime.

  4. What's the density of occurences of violent crimes in Baltimore? Is there a cluster of crime occurences from which we can get inferences or trends from?

    Hypothesis: We expect to see an increased crime rate in the business districts because of the amount of footfall that the area experiences during any time of the day.

  5. Can temperature affect violent crime occurences? If yes, then how?

    Hypothesis : During the wintertime, the temperature outside is usually lower, so people mostly prefer to stay indoors. Based on the assumption that violent crimes happens when there is less people, we expect to see increase in the violent crime rate on days with lower temperature.

The answers to the above questions are important because getting these answers will give more information to the police department about the frequency of occurence of times at various times during the day and the locations where these crimes occur the most. This will allow them to allocate resources accordingly to help reduce crime in one of the most crime-ridden cities of the United States.

B. Dataset Description

We obtained the crime dataset from the Baltimore Police Department website (https://www.baltimorepolice.org/crime-stats) and we used a Python API provided by https://www.worldweatheronline.com/ to retrieve weather data from 2013 onwards

This will be a Data Analysis Focus Project

Why Data Analysis?

The project should be graded more heavily on data analysis rather than data processing. This is because the questions that we have posed require use to go above and beyond normal data analysis. In this project, we have used K-means clustering to get the centers of various clusters in order to cluster our data. We have also extracted various features to be used in our model and we also created a Random Forest Regressor in order to predict the occurence of violent crimes on the basis of historic data. This is why our work goes beyond basic data analysis.

II. Data Processing

A. Data Cleaning and Analyzing Correlation between Missing Values

Data Cleaning

While examining our columns and their values, we come to a conclusion that there are certain columns that we don't need for our analysis and they are redundant columns. So we drop these columns and specify inplace = True to take a deep copy.

Analysis of correlation between the missing values of different columns

Observation :
  1. We observe that there is a strong correlation between Premise and Inside_Outside which seems to indicate that some rows do not have information about the Premise where the crime occurred due to which it is difficult to ascertain if the crime occurred Inside or Outside the premises.

  2. Similarly Post, District and Neighbourhood also have missing values that are strongly correlated to each other. This is because if we do not have information about the Post where the crime occurred then it is difficult to ascertain the Neighbourhood and District where the crime occurred.

  3. We also observe that Longitude and Latitude also have strong correlation which means that if Longitude is not given then it most likely that the Latitude is also missing.

Dropping null values in a significant feature- DateTime

Now, we come to our first column that has missing values and play with it to look into it in a more detailed way.

We observe that we have 24 missing values in the Date Time column. The time and date the crime incident occurred is an extremely important feature in our model to predict the location and time a violent crime occured. So we cannot impute values by forward filling or backward filling because it may skew the model results. Since there are only 24 rows with missing values, we drop them.

Converting Crime_Date and Crime_Time to datetime objects for efficient value extraction for easy manipulation of date and time

Analyzing the timeline of crime incidents

To analyze the number of crime incidents reported each year we'll extract the year of the crime from column Crime_Date using a vectorized string method str.extract() and regular expression to match the year format.

Splitting the Date-Time column into Date and Time Columns

We'll be splitting the CrimeDateTime column into Crime_date and Crime_Time column to be able to perform analysis using them seperately. We have used vectorized string method str.split() for splitting the column considering space between the two values as a delimeter and specifying expand=True for the function to return a dataframe.

Cleaning Post, District, Neighborhood

Observation :

  1. We observe that Post and District have 742 missing values. Neighborhood has 764 null values. So we look into the rows where Post has null values further.

We observe that Neighborhood and District also has many null values where Post has null values. We dig deeper into this by verifying this

We observe that Post, District and Neighborhood have parallel Null values. So we cannot impute values for the Post column using Neighborhood or District. So we decide to drop the null values and go ahead with other columns.

6. Cleaning Inside Outside

Now we move on to our next column, Inside_Outside column. We have the following plan of action for this column :

  1. There only has to be two unique values but there are many so we will fix that.
  2. This is a categorical variable but we will convert it to a numerical variable.

We can observe that this column should only have two unique values instead of four. So we replace the I and O with Inside and Outside

Now, we have three columns :

  1. Inside- This will be 1 if the incident has happened indoors.
  2. Outside- This will be 1 if the incident has happened outdoors.
  3. Inside_Outside_Null - this will be 1 if there's a null value for this in the original column

Handling Missing Values of Inside Outside columns.

Observations

  1. There are 47786 null values in the Inside/Outside (I/O) column , 14% of total numbers.
  2. The dataframe above shows that the missing values in I/O column happened in all crime categaries except homicide and shooting.
  3. If we drop all the rows with null value in the I/O column, it will cut off nearly 14% of the dataset. Also, it might drop lots of the violent crime data, which is an important feature for further analysis, such as all kinds of robbery.
  4. The better way is to keep all the rows wtih null value in the Inside/Outside column.

  5. In the meantime, we decided to check if we can use premise data to fill the null value in the Inside/Outside column.

7. Cleaning Premise Column

Observation:
  1. There are 47786 NaNs in 'Inside_Outside' column and 47924 in 'Premise' column.
  2. The null value's proportion is around 14% of total numbers.
  3. When the 'Inside_Outside' value is null, the 'Premise' value is always null.
  4. There are 138 rows with null as in 'Premise' while not null as in 'Inside_Outside'.
  5. Can't use 'Premise' to fill the missing value in 'Inside_Outside'.

2. Grouping Crime Types

Our goal of the project is to predict if a violent crime will happen on a given time at a given location. For this, we need to group the below crimes to label them accordingly.

Categories:

  1. Violent Crimes
  2. Auto-Related Crimes
  3. Others

Integrating weather data

World weather online is a website which provides global weather forecast and weather API to access weather data for any location in a range of formats like XML, CSV, JSON. We'll be using retrieve_hist_data() function from wwo-hist python package which encapsulates the weater API from World Weather Online.

We have specified the following in the retrieve_hist_data function to obtain historical weather data:

  1. location - specified using name of city/town, postal/zip code or lattitude and longitute of a location.
    We have specified location as 'Baltimore' which is the city name.
  2. start_date as 1st January 2013
  3. end_date as the date of last crime record in our data 24th September 2021
  4. API key obtained from World weather online website to access data
  5. We specified the frequency of data to be collected as 2hrs
  6. We specified export_csv as True to save the data as a csv file on the current working dictory
  7. We specified store_df as True to store the dataframe(s) obtained as a list in the work space.

API Call for Weather Data

III. Data Analysis

1. Analyzing the Crime Occurrence by Month

In this section, we want to find if there is a peak in the occurence of crime during a certain month and also look for a trend for crime occurence along the months.
We also want to check if different seasons or month affect the category or type of crimes taking place.
So we first created three different variables with the number of crimes divided by categories (Violent, Automobile and Other crimes)and grouped them by the month and used that to plot a stacked bar graph using plotly.graph_objects module.

Inferences

2. Analyzing the Crime Distribution in a Day

We want to understand whether crime occurrence changes over time within a day and if any peak time could be observed. We hypothesized that generally, the crime rate is higher during night time compared to the day time.
In order to see the pattern of crime occurrence in a day, we first group the data frame by hour. As above, we also classify all crime instances into three types based on description to gain further insight and display it in a stacked bar chart.

Inferences

3. Analyzing the Density of Violent Crime in Baltimore

What's the density of occurences of violent crimes in Baltimore? Is there a cluster of crime occurences from which we can get inferences or trends?

One of the research questions of our project is to find the density of occurences of violent crimes at various locations in Baltimore. For this, we create a new dataframe called isviolent which has only violent crimes in it. We then use create a heatmap for the occurence of violent crimes using Folium's HeatMap() function. This heatmap will allow us to see the clustered data which can then be used to derive trends from the data.

We observe that there are a lot of violent crimes occcuring in and around Charles Center as seen from the warm colour spectrum of the heatmap. We can infer that this is because it is in the popular downtown business district of Baltimore. It has a lot of people coming to work in this district which results in a lot of footfalll of people who become a victim of various violent crimes.

4. Finding Out Temperature Sensitive Crimes

Because there are 90 different average temperatures in the data, we decide to divide them into 15 bins.

Because we noticed that there will be more than one crime happening in a single day, we decided to standardize the data by using the numbers of each date. For this purpose, we have to drop the duplicated date and to count how many date there are in the whole dataset.

We decided to create a dataframe only shows the total numbers of the crimes in each bin, so there are several steps to take.

The dataframe above shows the total number of each crime in each temperature bin.
For the visualization of teperature-sensitiveness, the next step is to standardize and percentize the data by using the numbers of date counts in each temperature bin.

Obeservations and Inferences about Temperature and Crimes

We'd consider 4 major violent crimes in our analysis to understand the effects of varying temperature. Based on our hypothesis, the number of violent crimes were expected to be more in winter, however, after analysis, we're able to conclude that it is not the case as the crime rate increases with temperature.

5. Clustering on Postal Code (K-means) and Model Building

In our dataset, we have columns for the Latitude and the Longitude. We will use this to form clusters of areas based on the Latitude and Longitude. For this, we will use the K-means clustering algorithm which uses the centroid using the average euclidean distance.

  1. We will use the sklearn package to implement K-means clustering.
  2. There are 127 unique zip codes in our Baltimore dataset. So we aim to create 127 clusters.

Feature Extraction for the Model

There are a lot of feature columns that need to be calculated for the prediction model. We calculate the following columns:

  1. First, we groupby the dataset by Cluster label and Crime Date.
  2. Then we calculate the sum of violent crimes 120 days, 30 days, 7 days, 1 day before a given date for a given cluster.
  3. Similarly, we calculate the sum of Auto related and other crimes for 120, 30, 7, 1 day before a given date and for a cluster.
  4. Then we merge it with the weather data to extract Precipitation, Cloud Cover, Min temp and Max temp to input them as independent variables
  5. Our dependent variable would be the binary value for isViolent column.
  6. We use a random forest regressor model to predict the occurence of a violent crime in a location at a given time.

Since we have grouped the data and taken a sum of the violent crimes, our isViolent column in the Y variable will have the sum of all the violent crimes for a given date. In order to make it a binary value of 0 or 1, we will put a condition to obtain our outcome variable.

Random Forest

Now we use a prediction model to predict the occurence of a violent crime in a given time and location. For this we use a Random Forest Regressor Model using the sklearn package.

  1. A random forest regressor model is based on the supervised learning concept. It has a good performance on use cases with non-linearity. All of the trees in the random forest run in parallel with each other.

  2. We input our independent and dependent variables for our model.

  3. X's are our independent variables and Y is our set of dependent variables.

  4. We split the dataset between train and test data set. We need the test data set to validate our model with the predictions.

  5. We calculate the performance of the model based on some metrics of the random forest regressor model.

Performance of the Model

For measuring the accuracy and precision of the forcast/prediction model we use various Key Performance Indicators (KPI) which gives us an approximate of the magnitude of error or deviation of the predicted values from the actual values. The KPI's we used to to measure the accuracy of our model are :

The RMSE and the MAE and the MSE can be made better by enhancing the performance of the model.

Observations :

  1. We have used all of the independent variables that we have calcualted and obtained an accuracy score of approximately 50.3%
  2. There are a lot of limitations in our model as it achieved only an accuracy of 51.24 %.
  3. More feature selection and XGBoost Classifier will be needed to enhance the performance of the model.
  4. The prediction of the violent crime based on location and time wil help the resource allocation of the police department.
  5. The model can be made better by enhancing the features and getting more information about the police station centres to make the process more efficient.

IV. Conclusion

This project aims to identify some important features for crime prediction in Baltimore area to provide police force planning and arranging resources to tackle crimes.

Based on our analysis, we find out that weather and time are the factors that have an impact on crime occurrence: